A Linear Memory CTC-Based Algorithm for Text-to-Voice Alignment of Very Long Audio Recordings

نویسندگان

چکیده

Synchronisation of a voice recording with the corresponding text is common task in speech and music processing, used many practical applications (automatic subtitling, audio indexing, etc.). A approach derives mid-level feature from finds its alignment to by means maximizing similarity measure via Dynamic Time Warping (DTW). Recently, Connectionist Temporal Classification (CTC) was proposed that directly emits character probabilities uses those find optimal text-to-voice alignment. While this method yields promising results, memory complexity search remains quadratic input lengths, limiting application relatively short recordings. In work, we describe how recent improvements brought textbook DTW algorithm can be adapted CTC context achieve linear complexity. We then detail our overall solution demonstrate it align several hours mean error 50 ms for speech, 120 singing voice, which corresponds median below both types. Finally, evaluate robustness transcription errors different languages.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A recursive algorithm for the forced alignment of very long audio segments

In this paper we address the problem of aligning very long (often more than one hour) audio files to their corresponding textual transcripts in an effective manner. We present an efficient recursive technique to solve this problem that works well even on noisy speech signals. The key idea of this algorithm is to turn the forced alignment problem into a recursive speech recognition problem with ...

متن کامل

a new type-ii fuzzy logic based controller for non-linear dynamical systems with application to 3-psp parallel robot

abstract type-ii fuzzy logic has shown its superiority over traditional fuzzy logic when dealing with uncertainty. type-ii fuzzy logic controllers are however newer and more promising approaches that have been recently applied to various fields due to their significant contribution especially when the noise (as an important instance of uncertainty) emerges. during the design of type- i fuz...

15 صفحه اول

Accurate Audio-to-Score Alignment for Expressive Violin Recordings

An audio-to-score alignment system adaptive to various playing styles and techniques, and also with high accuracy for onset/offset annotation is the key step toward advanced research on automatic music expression analysis. Technical barriers include the processing of overlapped notes, repeated note sequences, and silence. Most of these characteristics vary with expressions. In this paper, the a...

متن کامل

Audio-to-text alignment for speech recognition with very limited resources

In this paper we present our efforts in building a speech recognizer constrained by the availability of very limited resources. We consider that neither proper training databases nor initial acoustic models are available for the target language. Moreover, for the experiments shown here, we use grapheme-based speech recognizers. Most prior work in the area use initial acoustic models, trained on...

متن کامل

Low-Delay Singing Voice Alignment to Text

In this paper we present some ideas and preliminary results on how to move phoneme recognition techniques from speech to the singing voice to solve the low-delay alignment problem. The work focus mainly on searching the most appropriate Hidden Markov Model (HMM) architecture and suitable input features for the singing voice, and reducing the delay of the phonetic aligner without reducing its ac...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Applied sciences

سال: 2023

ISSN: ['2076-3417']

DOI: https://doi.org/10.3390/app13031854